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Multi-species distribution modeling, which relates the occurrence 
of multiple species to environmental variables, is an important tool 
used by ecologists for both predicting the distribution of species in a 
community and identifying the important variables driving species co¬ 
occurrences. Recently, Dunstan, Foster and Darnell [Ecol. Model. 222 
(2011) 955-963] proposed using finite mixture of regression (FMR) 
models for multi-species distribution modeling, where species are 
clustered based on their environmental response to form a small num¬ 
ber of “archetypal responses.” As an illustrative example, they ap¬ 
plied their mixture model approach to a presence-absence data set of 
200 marine organisms, collected along the Great Barrier Reef in Aus¬ 
tralia. Little attention, however, was given to the problem of model 
selection—since the archetypes (mixture components) may depend on 
different but likely overlapping sets of covariates, a method is needed 
for performing variable selection on all components simultaneously. 

In this article, we consider using penalized likelihood functions for 
variable selection in FMR models. We propose two penalties which 
exploit the grouped structure of the covariates, that is, each covari¬ 
ate is represented by a group of coefhcients, one for each component. 

This leads to an attractive form of shrinkage that allows a covariate to 
be removed from all components simultaneously. Both penalties are 
shown to possess specific forms of variable selection consistency, with 
simulations indicating they outperform other methods which do not 
take into account the grouped structure. When applied to the Great 
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Barrier Reef data set, penalized FMR models offer more insight into 
the important variables driving species co-occurrence in the marine 
community (compared to previous results where no model selection 
was conducted), while offering a computationally stable method of 
modeling complex species-environment relationships (through regu¬ 
larization) . 

1. Introduction. Multi-species distribution modeling, which relates the 
occurrence of multiple species to environmental covariates, is an important 
tool both for predicting how a species community will respond to changing 
environmental conditions and for identifying important environmental vari¬ 
ables driving species co-occurrences [Ferrier and Guisan (2006), Ovaskainen, 
Hottola and Siitonen (2010), Pollock et al. (2014)]. To construct such mod¬ 
els, statistical methods are required which can handle the underlying het¬ 
erogeneity in species-environment relationships (i.e., different species in the 
same community can have very different environmental responses), while 
providing accurate predictions for rare species that may not be modeled re¬ 
liably on their own [by borrowing strength across organisms in a community, 
Ferrier and Guisan (2006), Ovaskainen and Soininen (2011)]. 

Recently, Dunstan, Foster and Darnell (2011) proposed using finite mix¬ 
ture of regression [FMR, Wedel and DeSarbo (1995)] models for multi¬ 
species distribution modeling of presence-absence (binary) data, where spe¬ 
cies are clustered based on their environment response to form a small num¬ 
ber of “archetypal responses.” The methodology was extended by Hui et al. 
(2013) and Dunstan et al. (2013) to handle count and biomass data, the lat¬ 
ter being a nonnegative continuous value representing the combined weight 
of all individuals of each species. By clustering species into archetypes, and 
modeling each archetype using a generalized linear model [GLM, McGullagh 
and Nelder (1989)], these Species Archetype Models or SAMs offer a power¬ 
ful approach to modeling heterogeneity in a community’s response to a set of 
covariates. Glustering species based on environmental response is also con¬ 
sistent with recent findings in ecology that groups of species tend to respond 
in a similar manner to environmental gradients [at least with the resolution 
of most multi-species data sets, Glark (2010)]. Moreover, Hui et al. (2013) 
showed SAMs offer strong predictive performance of rare species by borrow¬ 
ing strength from more prevalent species classified to the same archetype. 

While these initial results for SAMs showed promise, little attention was 
given to the important issue of model selection. In their application of SAMs 
to a data set of presence-absence records collected for 200 species along the 
Great Barrier Reef off the northeast coast of Australia, Dunstan, Foster and 
Darnell (2011) used a heuristic version of BIG [Schwarz (1978)] to select 
the “types of covariates” to enter in the model, that is, physical habitat 
covariates, oceanographic measures or both. Also, both Hui et al. (2013) 
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and Dunstan et al. (2013) a priori fixed the set of covariates to enter into 
their respective SAMs. In all three articles, the same set of covariates were 
entered into each archetype. This is a restrictive requirement, as it fails to 
account for the numerous (and unknown) ways which organisms react to 
their environment. 

Since the component densities may depend on different but likely overlap¬ 
ping sets of covariates, a method is needed for performing variable selection 
on all components in an FMR model simultaneously. For SAMs especially, 
simultaneously performing variable selection over all archetypes is key to 
identifying which environmental variables are important in structuring the 
species community. On the other hand, given the number of candidate mod¬ 
els is considerably larger than in the standard GLM context, methods which 
require fitting all possible models, for example, information criteria, are im¬ 
practical. 

In this article, we consider using penalized likelihood methods for vari¬ 
able selection in FMR models and SAMs. Since each covariate in an FMR 
model is represented by a group of coefficients, one for each component (and 
whose true value may be zero), we propose two penalties which exploit this 
grouped structure. This leads to an attractive form of shrinkage that al¬ 
lows a covariate to be removed from all components simultaneously. The 
first penalty proposed is a modification of the group LASSO [Yuan and Lin 
(2006)] to FMR models, called MIXGL2 (since it is an ^ 2 -norm penalty), 
which is applied across components on a per covariate basis. The second 
penalty, called MIXGLl, is based on the square root of the £i-norm and 
allows the component densities to depend on different sets of covariates. 
In a diverging number of covariates settings (i.e., the number of parame¬ 
ters grows a slower rate than sample size), we demonstrate that MIXGLl 
and MIXGL2 each possess a specific form of variable selection consistency. 
Furthermore, simulation studies show MIXGL2 and MIXGLl outperform 
other penalties which do not take into account the grouped structure of the 
covariates, both in variable selection and prediction. 

We apply both penalties to construct multi-species distribution models for 
the aforementioned Great Barrier Reef data set. This data set was collected 
as part of a larger biodiversity project aimed at identifying the key envi¬ 
ronmental variables important in structuring seabed biodiversity, as well as 
predicting future distributions of species communities along the Great Bar¬ 
rier Reef [Pitcher et al. (2007)]. It consists of presence-absence data of 200 
species collected at 1189 sites, along with 13 environmental covariates. Gom- 
pared to the results from unpenalized SAMs [Dunstan, Foster and Darnell 
(2011), Hui et al. (2013)], the penalized versions offer two advantages: (1) 
clearer insight is gained into species co-occurrence, as the penalties provide 
an automated way of identifying the variables informing each archetypal re¬ 
sponse; (2) the inclusion of penalties smooths the likelihood and leads to a 
more stable estimation procedure for SAMs. 
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We conclude this introduction by reviewing previous literature on penal¬ 
ized likelihood methods for FMR models. Khalili and Chen (2007) were the 
first to extend the LASSO [Tibshirani (1996)] and SCAD [Fan and Li (2001)] 
penalties to FMR models by applying these penalties on a per component 
basis, that is, each component has its own tuning parameter. For FMR mod¬ 
els with a diverging number of covariates in each component, Khalili and 
Lin (2013) proposed elastic-net type penalties [i.e., a linear combination of 
a sparsity-inducing and a ridge penalty, Zou and Hastie (2005)], which were 
also applied on a per component basis. For mixtures of linear regression in 
particular. Stabler, Biihlmann and van de Geer (2010) applied the adaptive 
LASSO [Zou (2006)] with one tuning parameter for the entire model. To 
date, no penalty has been proposed which exploits the grouped structure of 
the covariates in FMR models, something we investigate in this article. 

2. Finite mixture of regression models. Consider a sample of observa¬ 
tions {{xi,yi);i = 1,... ,n}, where tji is a univariate, independent and iden¬ 
tically distributed response and Xj is a p x 1 vector of covariates. We allow 
p to grow polynomially with sample size, that is, p'^ jn —)• 0 for some v > 1. 
The precise value of v is specified later on in Section 3.1. All covariates 
are assumed to have been standardized to mean zero and variance one. For 
an FMR model with K components, the conditional density function for 
observation i is given as follows: 

K p 

(1) h{yi-,Xi,^) = '^7rkf{yi;:xi,pik,4>k)] gj^J-ik) = /3ofc + xg^ki, 

k=l 1=1 

where tt = (tti, ..., vr;^) denotes the mixing proportions satisfying > 0, 
Ylk=i^k = 1 and /(y;X,/ifc,(/)fc) is the kth component density assumed to 
come from the exponential family with mean and dispersion parameter 
4>k- For observation i, the mean conditional on belonging to the A:th compo¬ 
nent, pik, is regressed against covariates Xj using a GLM with link function 
y(-) and coefficients {/3ki'J = 1,. ■. ,p}- 

Let Pi = ... ,Pki) be the vector of coefficients corresponding to co¬ 

variate 1. Notice the coefficients are concatenated on a per covariate basis— 
this reflects the grouped structure of the covariates, that is, each covariate 
is represented by K coefficients, one for each component and whose true 
value may be zero. Finally, let P = {Pi,... ,Pp) be the K x p matrix of re¬ 
gression coefficients, and 'I' = {Pi,..., Pp, Pm,..., Pok,P,'^) denote all the 
parameters in the FMR model, where cf) = {pi ,..., px)- 

In this article, it is assumed the parameters in the FMR model in equa¬ 
tion (1) are generically identifiable up to a permutation of the component 
labels [see condition (Al) in the Supplementary Material, Hui, Warton and 
Foster (2015b)]. Furthermore, we develop our asymptotic theory assuming 
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the number of components K is known [analogous to Stadler, Biihlmann 
and van de Geer (2010), Khalili and Lin (2013)], although for our appli¬ 
cation with SAMs we propose a BIC-type criterion to select the number 
of archetypes. General discussions regarding parameter identifiability for 
mixture models can be found in McLachlan and Peel (2004) and Friihwirth- 
Schnatter (2006), with the specific case of generic identifiability discussed in 
Follmann and Lambert (1991), Hennig (2000) and Griin and Leisch (2008), 
among others. 


3. New penalties for variable selection. To exploit the grouped structure 
of covariates in FMR models and SAMs, we consider penalized likelihood 
methods using penalties which are applied across components on a per co¬ 
variate basis, 

i=i 

where ln{^) = ELi Xj, Mifc, 0fc)) is the observed log-like¬ 

lihood and V{(3i) denotes a penalty function which is nonnegative and sat¬ 
isfies V{0) = 0. Let (3 = {(3i,... ,/3p) denote the unpenalized maximum like¬ 
lihood estimates of /3. We propose two penalty forms for V{f3i). The hrst is 
a modification of the group LASSO [Yuan and Lin (2006)] for FMR models. 


Definition 1. For the FMR model defined in equation (1), the M1XGL2 
estimates are given by maximizing the penalized log-likelihood function 




/ j A / ^ 

1 = 1 \ k = l 


K 




where wi = {Ylk=i ^li) 7 > 0. 


M1XGL2 possesses the group sparsity property, that is, it is nondifferen- 
tiable when f3ii = ■ ■ ■ = 13 ki = 0 for covariate I. This is an attractive property 
to have for variable selection in FMR models, as it encourages a covariate to 
be removed from all components simultaneously. Such a form of sparsity in 
the solution is useful for multi-species distribution modeling, as often there 
are numerous environmental covariates which are completely uninformative 
for all archetypes (a covariate is defined as completely uninformative if all its 
coefficients are truly zero). MIXGL2 is useful for screening these covariates 
out, potentially as a first stage in variable selection for SAMs. 

The second penalty we propose is based on the square root of a weighted 
^i-norm. 
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Definition 2. For the FMR model defined in equation (1), the MIXGLl 
estimates are given by maximizing the penalized log-likelihood function 




l=l \ k=l 


K 




where Wki = \Pki\ and 7 > 0 . 


MIXGLl not only possesses the group sparsity property like MIXGL2, 
it also possesses individual coefficient sparsity analogous to the adaptive 
LASSO, that is, it is also nondifferentiable for all individual coefficients 
j5ki- This individual sparsity allows MIXGLl to remove covariates from 
only K' < K components. It is therefore well suited to species distribu¬ 
tion modeling—since the archetypal responses typically depend on different 
sets of covariates, MIXGLl can accommodate for differing mean structures 
in each archetype. Put another way, the set of variables that drive the co¬ 
occurrences of one group of species are usually slightly different to those 
that drive the co-occurrence of another group. The form of MIXGLl allows 
for this, while maintaining the ability to remove completely uninformative 
covariates from the entire SAM. Of course, the choice of which penalty also 
depends on the analysis objectives—sometimes, it is of interest to see which 
covariates affect any (or all) components, in which case MIXGL2 is more 
appropriate. Other times, we may want to know how each covariate affects 
the archetypes in the most compact way, in which case MIXGLl is suitable. 

Definition 2 is a special case of the Gomposite Absolute Penalty (CAP) 
family of penalties [Zhao, Rocha and Yu (2009)], although our work is the 
first to apply such a penalty to the FMR model context. Both MIXGL2 
and MIXGLl incorporate data-dependent weights based on the unpenalized 
estimates, /3, with the severity of these weights controlled by 7 . The inclusion 
of weights builds on the idea of the adaptive LASSO and allows the penalized 
estimates to achieve desirable large sample properties as discussed in the 
next section. Finally, unlike the penalties in Khalili and Chen (2007) and 
Khalili and Lin (2013) which are applied on a per component basis, MIXGLl 
and MIXGL2 do not depend on the mixing proportions. When penalization 
occurs on a per component basis, having penalties which are a function of 
TT makes sense since it relates the severity of penalization to the “effective 
sample size” of each component. For penalization across components on a 
per covariate basis, specifically, the MIXGLl and MIXGL2 penalties, the 
need to incorporate mixing proportions is less obvious. 


3.1. Asymptotic properties. In this section, we consider the large sample 
behavior of the MIXGL2 and MIXGLl estimators. As mentioned previously. 
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we assume the number of components K is fixed and known a priori, but 
the number of covariates in each component grows with sample size n. We 
shall use pn, as well as a subscript n in other quantities, for example, Xn 
and /3„, to reflect this. Let .. ■, I3n,p^, be 

parameter values corresponding to the true model, which is assumed to be 
identifiable. We can partition all the regression coefficients in the true model 
as follows: 

Definition 3. The regression coefficients in the true model, {(3^ i, ■. ■, 
(3^ri,pn) can be partitioned into the following mutually exclusive sets: 

• An = {(A:, 1) ■ 0} denotes the set of truly nonzero coefficients. 

• Bn = {{k,l): = 0,\\P^\\2 A denotes the set of zero coefficients be¬ 

longing to partly uninformative covariates. 

• Cn = {(A;, 0 ■ l^nki ~ b, ||/3j lb = 0} is the set of zero coefficients belonging 
to completely uninformative covariates. 

As formalized below, the group sparsity property of MIXGL2 allows it to 
asymptotically set all coefficients belonging to set Cn to zero, while the com¬ 
bined group and individual coefficient sparsity property of MIXGLl allows 
it to asymptotically set all coefficients belonging to sets Bn and Cn to zero. 

For both penalties, assume the following regularity conditions are satis¬ 
fied: 

(A) XnO-n — Opiri ! ), 

(A') AnOn = 

(B) plliXlbn) = Op{n), 

(C) Pnln^O, 

(C') pl/n^O, 

where for the MIXGL2 penalty, an = m.a-x{wn,i',l G An} and bn = min{u;^;; 
I G Cn}, and, analogously for the MIXGLl penalty, an = raax{wki', {k,l) G 
An} and bn = min{u;fcj; (k, 1) G Bn^Cn}- Gonditions (A) and (A') ensure the 
existence of penalized likelihood estimates which are asymptotically unbi¬ 
ased, while condition (B) ensures an appropriate degree of shrinkage. The 
rate of growth of the number of covariates in conditions (C) and (G^ is 
the same as Khalili and Lin (2013), and we believe it to be appropriate in 
many applications of species distribution modeling in ecology, that is, the 
number of environmental variables recorded is usually small compared to 
the number of sites visited. 

We hrst consider the asymptotic behavior of the MIXGL2 estimator: 

Theorem 1 (Oracle property—MIXGL2). Assume conditions (A)-(C) 
hold. Then there exists a local maximizer ^n of in^^i'^n) in Definition 1 
which satisfies the following: 


F. K. C. HUI, D. I. WARTON AND S. D. FOSTER 


• Estimation consistency: = Op{y^pnln). 

• Covariate selection consistency: = 0) —)• 1. 

• Asymptotic normality: If conditions (A') and (C) are also satisfied, then 

where = 0n,c^, fin,oi, ■ ■ ■ ,fin, 0 K,^n,^n), Tn is aqx \C^\ matrix such 

that r„r^—>-PG, and is the Fisher information matrix knowing 


All proofs have been relegated to the Supplementary Material [Hui, Warton 
and Foster (2015b)]. Theorem 1 states MIXGL2 is covariate selection con¬ 
sistent, that is, it will asymptotically remove completely uninformative co¬ 
variates from the FMR model. On the other hand, if the true model contains 
partly uninformative covariates {Bn 0), MIXGL2 will in the large sample 
limit retain these covariates in all components. This makes sense because 
MIXGL2 does not possess individual coefficient sparsity. By contrast, if we 
consider the asymptotic behavior of the MIXGLl estimator, then we have 
the following: 

Theorem 2 (Oracle property—MIXGLl). Assume conditions (A)-(G) 
hold. Then there exists a local maximizer ^n of in^'^{'itn) in Definition 2 
which satisfies the following: 

• Estimation consistency: = Op{^JpfifiJn). 

• Coefficient selection consistency: P{(3n BnUCn =0) —)■ 1. 

• Asymptotic normality: If conditions (A') and (C) are also satisfied, then 

V^rnIn{KAS^\'^n,An ” J 4 iV(0, G), 

where ’^n,An = {fin,An^ fiwOi, ■ ■ ■, fin, 0 K,(j>n^^n), || ' || denotes the i 2 -norm, 
r„ is a qx \ An\ matrix such that r^r^j^^G, and In{^^An) Fisher 

information matrix knowing An- 

Theorem 2 states MIXGLl is coefficient selection consistent, that is, it will 
asymptotically remove completely uninformative covariates and zero coef¬ 
ficients belonging to partly uninformative covariates from the FMR model. 
This is a stronger form of selection consistency compared to MIXGL2, and 
is a desirable property in terms of identifying the truly important covariates 
facilitating species co-occurrence. 

To compute the MIXGL2 and MIXGLl estimates, we use an estimation 
procedure combining the Expectation Maximization [EM, Dempster, Laird 
and Rubin (1977)] algorithm with a local quadratic approximation [LQA, 
Ean and Li (2001)]. Details regarding this estimation procedure, and a proof 
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that it possesses the desired ascent property, may be found in the Supple¬ 
mentary Material [Hui, Warton and Foster (2015b)]. To select the tuning 
parameters (An,7}, we use a BIC-type criterion [see Zhang, Li and Tsai 
( 2010 ) for the use of information criteria in selecting tuning parameters], 

= - 24 (T'„) -L log(n) dim(^„), 

where dim(/3„) denotes the number of nonzero estimates in /3^. Note 
and dim(/3„) both depend on and 7 , for example, as An increases, more 
values of 0n shrunk to zero. For our simulations and applications, we 
select from 7 G {0.5,1,2}. 

4. Application. We apply MIXGLl and MIXGL2 to Species Archetype 
Models and the Great Barrier Reef data set introduced in Section 1. To 
recap, the data set consists of presence-absence data of 200 species at 
1189 sites, along with 13 environmental covariates. Five of these covari¬ 
ates were descriptors of physical habitat: depth (BATHY), bottom stress 
(BSTRESS), percent gravel (GRAVEL), percent mud (MUD), percent car¬ 
bonate (GARBON), and the other eight covariates were oceanographic mea¬ 
sures: mean and intra-annual standard deviation of temperature (T.AV, 
T.SD), mean and intra-annual standard deviation of oxygen concentration 
(02. AV, 02.SD), mean and intra-annual standard deviation of salinity (S.AV, 
S.SD), mean and intra-annual standard deviation of K490 (K490.AV, 
K490.SD). K490 is measure of turbidity based on light penetration and is 
related to the presence of light scattering particles in the water. 

To model the 200 species separately using a GLM (say) would be a difficult 
task, especially since 164 out of 200 species are present at less than 5% of the 
sites. Such sparsity in the response (in the sense that most species are rarely 
observed) is characteristic of multi-species data and motivates methods such 
as SAMs which are able to borrow strength across species. On the other 
hand, while the applications of SAMs in Dunstan, Foster and Darnell (2011) 
and Dunstan et al. (2013) were mainly for illustrative purposes, the goal in 
this article is to perform (consistent) variable selection in order to identify 
the key drivers of species co-location. 

Let Y denote the multi-species response matrix collected i = 1,... ,n sites 
for j = 1 ,..., s species, with [Y]jj = 1 if species j was found at site i and 0 
otherwise. For the Great Barrier Reef data set, n = 1189 and s = 200. Let Xj 
be the vector of environmental covariates at site i. SAMs are an extension 
of FMR models in (1) to the case of product Bernoulli component densities: 

K / n 

/i(yj;xi, '4^) = ^vTfc ( ^/^^{.(l - 
k=i \j=i 
p 

logit (/r^-fc) = (3oj + ^ XiiPki- 

1=1 


(2) 
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Note in (2), the intercepts Pqj are species-specific instead of component- 
specific. As discussed in Dunstan et al. (2013), this is so that species are 
clustered solely on the shape of their environmental responses and not as 
well on the prevalence. 

For each of the 13 covariates, we fitted linear and quadratic terms, result¬ 
ing in 26 terms available for selection in each archetype. We used MIXGL2 
and MIXGLl to perform variable selection, with the tuning parameters cho¬ 
sen using BIGa „,7 defined in Section 3.1, and log(n) replaced by log(s) as 
the model complexity penalty. The reason for this replacement is because 
the fundamental observational unit in SAMs is a species instead of a site [see 
Dunstan, Foster and Darnell (2011) and also the Discussion in Section 6]. For 
a fixed K, the combined EM plus LQA algorithm used for fitting penalized 
FMR models can be straightforwardly extended to the case of SAMs [see the 
Supplementary Material for details, Hui, Warton and Foster (2015b)]. To se¬ 
lect the number of archetypes, we considered a candidate range K = [1,20] 
and used another BIG-type criterion [see Sections 6.8-6.9, McLachlan and 
Peel (2004), on using information criteria for selecting K\, 

BICk = - 24 ('i'(i^)„) +log(s)dim(4'(i^)„), 

where dim(\E'(A'))„ denotes the number of nonzero parameters in the SAM 
with K archetypes. The model selection procedure thus proceeds as follows: 
for each candidate K, the best penalized SAM is selected using BIGa,^,.^. 
Afterward, the final model is selected from these “best penalized SAMs” 
using BlCi^. 

Figure 1 shows a plot of coefficients estimates for the two penalized SAMs, 
with exact values of the coefficients provided in the Supplementary Mate¬ 
rial [Hui, Warton and Foster (2015b)]. Using BIC;^, MIXGL2 and MIXGLl 
produced SAMs with K = 9 and 10 archetypes, respectively [note Dunstan, 
Foster and Darnell (2011) chose K = 11 archetypes when applying unpe¬ 
nalized SAMs to the same data set]. Compared with the unpenalized SAM 
in Dunstan, Foster and Darnell (2011), penalized variable selection offers 
much more insight into the drivers of seabed biodiversity. Both models in¬ 
dicated the five physical habitat covariates were informative for almost all 
archetypes (Figure 1— bottom 10 rows in both plots), meaning these vari¬ 
ables are important in structuring the entire marine species assemblage. In 
contrast, MIXGLl and MIXGL2 deemed several of the oceanographic mea¬ 
sures to be either completely or close to completely uninformative, for exam¬ 
ple, the quadratic terms T.SD^ and S.SD^. It is also in these oceanographic 
covariates where the archetypal responses were largely differentiated for the 
MIXGLl model. For example, archetype 2 was the only archetype where 
the four terms corresponding to mean and intra-annual standard deviation 
of oxygen were informative. That is, the co-occurrence of species classified 
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Fig. 1. Map of coefficients for penalized SAMs with MIXGL2 (left) and MIXGLl penal¬ 
ties (right). Empty circles indicate coefficients shrunk to exactly 0. Based on BIGk, the 
former chose K — 9 archetypes while the latter chose K = 10 archetypes. 


to this archetype could be partly attributed to their shared environmental 
response to oxygen. Also, archetypes 2 to 6 all had coefficients for mean 
annual temperature which were substantially different from zero. However, 
species in archetypes 3 and 6 tend to occupy regions of higher temperatures, 
while species classified to archetype 2, 4 and 5 tend to occur in relatively 
cooler regions. 

For prediction purposes, maps were constructed for each archetype show¬ 
ing how the probability of presence on the linear predictor scale varies spa¬ 
tially along the entire Great Barrier Reef. That is, for archetype fe = 1,..., A', 
these were constructed using the linear predictor. 


Vik 


Tj=l ^jkPoj 

Ylj=l ^jk 


P 

l=l 


which were then mapped across all sites in the Barrier Reef region. To clarify, 
we chose to map the linear predictors fuk directly rather than convert them 
to probabilities, as this generally makes it easier to identify any differences 
between the archetypes. Note the intercept used in the predicted maps is a 
weighted average of all species-specific intercepts, with weights proportional 
to the posterior probabilities of belonging to archetype k. These maps are 
provided in the Supplementary Material [Hui, Warton and Foster (2015b)]. 
Note that maps could also be constructed for each species, although for 
managerial purposes maps constructed on a per archetype basis tend to be 
more useful, since managing an archetype is equivalent to simultaneously 
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manaffine all the species classified into the archetype [Dunstan, Foster and 
Darnell (2011)]. 

Both MIXGL2 and MIXGLl produced some similar maps, for example, 
archetype 4 in both models exhibited a high probability of presence from 
the central to the southeast region of the reef toward the Coral sea, with 
the probability decreasing sharply toward land. Also, species in archetype 
9 for both models tend to be found with relatively high probability toward 
the southeast Queensland coast. Given the differing properties of MIXGL2 
and MIXGLl, however, it was not surprising to also observe some notable 
differences between the two sets of maps, for example, archetype 2 for the 
MIXGL2 fit displayed relatively high probabilities of presence in a small 
region on the southeast region of the Barrier Reef, where percent gravel 
was rather high. However, no corresponding archetype was observed for the 
MIXGLl fit—while archetypes 4 and 7 also had positive linear terms for 
percent gravel (Figure 1— right), the environmental response of species in 
these two components tends to be driven more by other covariates. This 
suggests incorporating a penalty such as MIXGLl, which allows different 
mean structures in each archetype and leads to more precise differentiation 
of the sources of species co-occurrences. 

5. Simulation study. A small simulation was performed to assess the 
finite sample performance of the MIXGLl and MIXGL2 in FMR models in 
equation (1). The number of covariates in each component was determined 
as Pn = — 5, where [•] is the ceiling function [see also Khalili and 

Lin (2013)]. Govariates {x^; i = 1,..., n} were generated from a standard 
multivariate Gaussian distribution with correlation structure Cor{xir,Xis) = 
Responses were then simulated from a K = 2 binomial FMR model 
with trial size 10, using the four models below: 

(/3oi,/3i) = ( 1, 0.7,2,-2, 1.5, 0, 0, 0, ...), 

Model I: (/3 o2 ,/92) = (-0.5, 2, 0, 0, 0, 1, -2,0.5,...), 

Model II: (/3 o2 ,/32) = (-0.5, 2, 0, 0, 1, -2,0.5, 0, ...), 

Model III: (/3 o2 ,/92) = (-0.5, 2, 0, 1, -2,0.5, 0, 0, ...), 

Model IV: (/3 o2 ,/92) = (-0.5, 2, 1,-2, 0.5, 0, 0, 0, ...), 

where “...” indicates extra zeros. The models are designed such that, as we 
move from models I to IV, the number of partly uninformative covariates 
decreases. We considered combinations of tti = 0.5,0.7 and n = 100,200,400, 
with the latter corresponding to pn = 7,9,12 covariates (excluding intercept) 
in each component, respectively. 500 data sets were generated for each com¬ 
bination. We assumed K = 2 was known in advance. 

We compared MIXGLl and MIXGL2 to three penalties proposed pre¬ 
viously for FMR models: adaptive LASSO [ADL, Stadler, Biihlmann and 
van de Geer (2010)], MIXLASSO-^ 2 ! MIXSCAD -^2 [Khalili and Lin 
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(2013)]. The latter two penalties are linear combinations of the ridge and 
LASSO (SCAD) penalty. ADL penalizes coefficients separately, while 
MIXLASSO -^2 and MIXSCAD -^2 penalize on a per component basis. Per¬ 
formance was assessed using mean sensitivity (proportion of true nonzeros 
estimated to be nonzero) and specihcity (proportion of true zeros estimated 
to be zero), and predicted log-likelihood. The latter was calculated using 
an independent test data set of n = 10,000 observations, with higher values 
implying better predictions. Note that to remove variation across the differ¬ 
ent data sets, we centered the predicted log-likelihood values by subtracting 
the average obtained across the different methods in each data set. Also, to 
deal with the problem of label-switching prior to calculating sensitivity and 
specihcity [see Section 4.9, McLachlan and Peel (2004)], we permuted the 
estimated coefficients so as to minimize the ^ 2 -iiorm between the estimated 
and true coefficients. For brevity, we only present results for tti = 0.5. Sim¬ 
ilar outcomes were observed for tti = 0.7, and these results are provided in 
the Supplementary Material [Hui, Warton and Foster (2015b)]. 

MIXGLl performed consistently well, with sensitivity and specihcity lar¬ 
gely unaffected by the number of completely uninformative covariates (Ta¬ 
ble 1). Compared to MIXGLl, the three methods which did not penal¬ 
ize on a per-covariate basis had lower specihcity indicative of overhtting. 
The result makes sense given the group sparsity property of MIXGLl and 
MIXGL2—for all four models, there was a proportion of coefficients which 
always belonged to completely uninformative covariates. Therefore, penal¬ 
ties which can remove a covariate from all components simultaneously were 


Table 1 

Mean sensitivity/specificity for various sample sizes and tti = 0.5. Transitioning from 
models I to IV, the proportion of completely uninformative covariates increases 


Sensitivity / Specificity 

n Model MIXGLl MIXGL 2 ADL MIXLASSO-^2 MIXSCAD-^2 


100 I 

II 

III 

IV 

200 I 

II 

III 

IV 

400 I 

II 

III 

IV 


0.962/0.948 

0.962/0.962 

0.966/0.972 

0.957/0.980 

0.983/0.986 

0.986/0.983 

0.985/0.985 

0.985/0.989 

0.998/0.992 

0.999/0.991 

0.999/0.995 

0.999/0.992 


0.947/0.040 

0.959/0.373 

0.957/0.747 

1/0.970 

0.960/0.320 

0.954/0.651 

0.956/0.864 

1/1 

0.979/0.636 

0.978/0.761 

0.981/0.884 

1/1 


0.958/0.897 

0.956/0.855 

0.950/0.832 

0.965/0.825 

0.994/0.948 

0.987/0.935 

0.994/0.911 

0.987/0.908 

0.999/0.961 

0.999/0.958 

0.999/0.940 

0.997/0.942 


0.991/0.170 

0.992/0.128 

0.989/0.160 

0.981/0.183 

0.998/0.388 

0.999/0.430 

0.999/0.415 

0.995/0.392 

1/0.546 

1/0.576 

1/0.561 

1/0.543 


0.978/0.690 

0.975/0.683 

0.970/0.647 

0.970/0.635 

0.996/0.826 

0.993/0.819 

0.995/0.803 

0.991/0.782 

1/0.918 

0.999/0.901 

0.999/0.897 

0.998/0.895 
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n = 100 n = 200 n = 400 



Fig. 2. Predicted log-likelihood (centered) as a function of a simulation model for 
TTi =0.5. The methods shown are the following: MIXGLl (Q), MIXGL2 (A), ADL (-{-), 
MIXLASSO-I 2 ), MIXSGAD -(.2 (O)- As we move from models I to IV, the proportion 
of completely uninformative covariates increases. Note the different scales on the y-axis 
for the three figures. 

much better at shrinking these coefficients to zero. As the number of partly 
uninformative covariates decreased, the performance of MIXGL2 dramati¬ 
cally improved, especially in specificity. Another interesting trend observed 
when Transitioning from models I to IV, in this simulation at least, was a 
slight but noticeable decline in specificity for ADL and MIXSCAD -.£2 (in¬ 
dicating overfitting). Finally, MIXLASSO -.^2 had the highest sensitivity in 
most models, but significantly lower specificity than all the other methods 
tested, which suggested a substantial amount of overfitting. 

The trends in sensitivity and specificity (Table 1) were similarly observed 
in predictive performance. MIXGLl predicted the best overall, while as we 
move from models I to IV and decrease the number of partly uninforma¬ 
tive covariates, predictions from MIXGL2 improved significantly (Figure 2). 
There is also a slight decreasing trend in predictive log-likelihood for ADL 
and MIXSCAD-^ 2 ; which appeared to coincide with a slight drop in speci¬ 
ficity. In the Supplementary Material [Hui, Warton and Foster (2015b)], we 
present an additional simulation conducted with mixtures of linear regres¬ 
sion, AT = 4, and more covariates, with similar results to those observed 
above despite the larger number of components. 

6. Discussion. Species in a community exhibit significant heterogeneity 
in their occurrence patterns. A major source of this heterogeneity is due 
to species responses being driven by different but potentially overlapping 
sets of environmental covariates. In the context of multi-species distribution 
modeling using SAMs, this means we require a method of variable selec¬ 
tion which can consistently identify the covariates responsible for shaping 
each archetypal response (and thus shaping species co-occurrences within 
each archetype). In this article, we proposed two penalties, MIXGL2 and 
MIXGLl, which exploit the grouped structure of covariates in FMR models 
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and SAMs. By penalizing across components on a per covariate basis, these 
penalties can remove a covariate simultaneonsly from all components of an 
FMR model. Both penalties were shown to posses specific forms of selec¬ 
tion consistency, with simulations indicating they outperform other penal¬ 
ties which do not take into account the grouped structure of covariates. 
Applying penalized SAMs to the Great Barrier Reef data set offered clearer 
insight into the variables structuring seabed biodiversity, while providing a 
relatively simple idea of how the species assemblage as a whole responds to 
the environment. 

While we have demonstrated that penalized SAMs can help to unravel 
how the distribution of a species community depends on environmental co¬ 
variates, imposing a penalty on the likelihood also offers computational ad¬ 
vantages. Given the high dimensionality {s/n being a nonnegligible ratio) 
and heterogeneity in environmental responses of multi-species data sets, the 
likelihood for a SAM is expected to be “bumpy” with numerous local max¬ 
ima. The estimates obtained from applying the EM algorithm to an unpe¬ 
nalized SAM may therefore depend heavily on the starting point, and may 
correspond to a local instead of the (one of K\ equivalent) global maxi¬ 
mum. Adding a penalty to the likelihood can help to resolve this problem 
by smoothing the likelihood surface and making the global maxima more 
apparent. 

As an illustration of this, we considered two models fitted to the Great 
Barrier Reef data set: (1) the MIXGLl penalized SAM with A = 10 and the 
tuning parameter fixed at it the final value used in Section 4; (2) an unpenal¬ 
ized SAM, that is, A = 0, with A = 10 and the 26 covariate terms included 
in each archetype. Each model was fitted 50 times using the same estimation 
procedure as in Section 4, each time using a random starting point gener¬ 
ated by simulated posterior probabilities for each species from a Dirichlet 
distribution with hyperparameters all set to 1. Figure 3 shows a comparative 
boxplot of the resulting log-likelihood values. Importantly, the variability of 
the log-likelihood values for the penalized SAM was smaller compared to 
the unpenalized SAM (ratio of variance = 1.510; A-test p-value < 0.01). The 
reduction in variability can be attributed to the MIXGLl penalty smoothing 
the SAM likelihood surface, removing some of the “bumps and small hills” 
and leading to a more stable estimation algorithm. 

In future work, we hope to extend penalized SAMs to multi-species pre¬ 
sence-only data, particularly given the commonality of such data and re¬ 
cently shown equivalences between point process models and Poisson/Logistic 
GLMs [Warton and Shepherd (2010), Fithian and Hastie (2013)]. Whether 
the consistency and oracle properties of MIXGLl and MIXGL2 hold in 
this context should be considered. Also, other penalties which exploit the 
grouped structure of the covariates should be considered [e.g., the sparse 
group LASSO, Simon et al. (2013)]. In our application to the Barrier Reef 
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Log-likelihood values for SAMs 
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Fig. 3. Comparative boxplot of the log-likelihood values from 50 fits of a penalized SAM 
with the MIXGLl penalty (left) and an unpenalized SAM (right). Both models fitted the 
same set of covariates and the same number of archetypes. 

data set, MIXGL2 and MIXGLl did not take into account the hierarchi¬ 
cal structure of polynomials, for example, the linear term for intra-annual 
standard deviation of K490 (K490.SD) was removed while the quadratic 
term remained in the model. While this still makes sense ecologically (i.e., 
given all covariates were centered, then the values of K490.SD where species 
are most likely to be found was around the average value observed in the 
data set), a penalty which explicitly obeys this hierarchical principle would 
be preferred, such as the fused lasso [Tibshirani et al. (2005)]. Finally, the 
validity of BIG or any other information criterion for choosing the tuning 
parameter in MIXGLl and MIXGL2 remains an open question. In particu¬ 
lar, whether the model complexity penalty for SAMs should be modified to 
log(s) (as was done in this article to reflect the fundamental observational 
unit being a species), remain as log(n) or perhaps be something else [see, 
e.g., Hui, Warton and Foster (2015a)] warrants further investigation. 

Acknowledgments. Thanks to the Associate Editor, two anonymous re¬ 
viewers and Bill Venables for useful discussions. 

SUPPLEMENTARY MATERIAL 

Supplement to “Multi-species distribution modeling using penalized mix¬ 
ture of regressions” (DOL 10.1214/15-AOAS813SUPP; .pdf). Material in¬ 
cludes technical proofs, details on estimation procedure and additional sim¬ 
ulation and application results. 
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